{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion: when to use each NA imputation method\n",
    "\n",
    "\n",
    "###  Which missing value imputation method shall I use and when?\n",
    "\n",
    "There is no straight forward answer to this question, and which method to use on which occasion is not set on stone. This is totally up to you. \n",
    "\n",
    "Different methods make different assumptions and have different advantages and disadvantages (see lecture \"Overview of missing value imputation methods\".\n",
    "\n",
    "**As a guideline I would say**:\n",
    "\n",
    "If missing values are less than 5% of the variable:\n",
    "replace by mean/median or random sample\n",
    "replace by most frequent category\n",
    "If missing values are more than 5% of the variable:\n",
    "do mean/median imputation+adding an additional binary variable to capture missingness\n",
    "add a 'Missing' label in categorical variables  \n",
    "If the number of NA in a variable is small, they are unlikely to have a strong impact on the variable / target that you are trying to predict. Therefore, treating them specially, will most certainly add noise to the variables. Therefore, it is more useful to replace by mean/random sample to preserve the variable distribution.\n",
    "\n",
    "If the variable / target you are trying to predict is however highly unbalanced, then it might be the case that this small number of NA are indeed informative. You would have to check this out.\n",
    "\n",
    "**Exceptions to the guideline**:\n",
    "\n",
    "If you / your company suspect that NA are not missing at random and do not want to attribute the most common occurrence to NA.\n",
    "If you don't want to increase the feature space by adding an additional variable to indicate missingness\n",
    "In these cases, replace by a value at the far end of the distribution or an arbitrary value.\n",
    "\n",
    "#### Final note\n",
    "\n",
    "NA imputation for data competitions and business settings can be approached differently. In data competitions, a tiny increase in performance can be the difference between 1st or 2nd place. Therefore, you may want to try all the feature engineering methods and use the one that gives the best machine learning model performance. It may be the case that different NA imputation methods help different models make better predictions.\n",
    "\n",
    "In business scenarios, scientist don't usually have the time to do lengthy studies, and may therefore choose to streamline the feature engineering procedure. In these cases, it is common practice to follow the guidelines above, taking into account the exceptions, and do the same processing for all features.\n",
    "\n",
    "This streamlined pre-processing may not lead to the most predictive features possible, yet it makes feature engineering and machine learning models delivery substantially faster. Thus, the business can start enjoying the power of machine learning sooner."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
